The first six weeks of the course will cover data science & fundamental exploratory data analysis in r. It is a brief introduction to the contemporary open science, best-practice data and biostatistical working tools. The primary goal is exploration of tools and approaches associated with effective, efficient, reproducible biostatistical analyses. Inspirations include software carpentry, rforcats, and many, more open, online, free I can direct you to as needed. The description provided by the university is a useful starting point in deciding if the Fall 2016 offering meets your specific needs.
use.ful
bio.stats
adventure.time
want data. have data. will collect data. assumption: in this course, you are here to work with data. data literacy IS data science.
Philosophy of R stats Statistical thinking: likelihood, error, and effect sizes. Contemporary statisticians embrace and are mindful of these three key concepts in all research, data, and statistical inference examinations.
Modes of inference with your data: using data, what can you infer or do? 1. Data description 2. Likelihood 3. Estimation 4. Baysian inference - weight and include what we know 5. Prediction 6. Hypothesis testing 7. Decision making -balance gains and risks
This set of ideas provide the foundation for many data science and statistical approached to working with evidence within almost every domain of science.
Data viz first and foremost. Case study #1.
#blog post by Fisseha Berhane
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
#Create four groups: setA, setB, setC and setD.
setA=select(anscombe, x=x1,y=y1)
setB=select(anscombe, x=x2,y=y2)
setC=select(anscombe, x=x3,y=y3)
setD=select(anscombe, x=x4,y=y4)
#Add a third column which can help us to identify the four groups.
setA$group ='SetA'
setB$group ='SetB'
setC$group ='SetC'
setD$group ='SetD'
#merge the four datasets
all_data=rbind(setA,setB,setC,setD) # merging all the four data sets
all_data[c(1,13,23,43),] # showing sample
## x y group
## 1 10 8.04 SetA
## 13 8 8.14 SetB
## 23 10 7.46 SetC
## 43 8 7.91 SetD
#compare summary stats
summary_stats =all_data%>%group_by(group)%>%summarize("mean x"=mean(x),
"Sample variance x"=var(x),
"mean y"=round(mean(y),2),
"Sample variance y"=round(var(y),1),
'Correlation between x and y '=round(cor(x,y),2)
)
models = all_data %>%
group_by(group) %>%
do(mod = lm(y ~ x, data = .)) %>%
do(data.frame(var = names(coef(.$mod)),
coef = round(coef(.$mod),2),
group = .$group)) %>%
dcast(., group~var, value.var = "coef")
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
summary_stats_and_linear_fit = cbind(summary_stats, data_frame("Linear regression" =
paste0("y = ",models$"(Intercept)"," + ",models$x,"x")))
summary_stats_and_linear_fit
## group mean x Sample variance x mean y Sample variance y
## 1 SetA 9 11 7.5 4.1
## 2 SetB 9 11 7.5 4.1
## 3 SetC 9 11 7.5 4.1
## 4 SetD 9 11 7.5 4.1
## Correlation between x and y Linear regression
## 1 0.82 y = 3 + 0.5x
## 2 0.82 y = 3 + 0.5x
## 3 0.82 y = 3 + 0.5x
## 4 0.82 y = 3 + 0.5x
#data viz instead as first step
ggplot(all_data, aes(x=x,y=y)) +geom_point(shape = 21, colour = "red", fill = "orange", size = 3)+
ggtitle("Anscombe's data sets")+geom_smooth(method = "lm",se = FALSE,color='blue') +
facet_wrap(~group, scales="free")
Outcome from stats first, data viz later (tricked), descriptive estimates of data can be deceptive. Draw first, then interpret.
Survey data from class. Case study #2.
#load class survey responses from google poll completed in class
survey<-read.csv("data/5081.survey.1.csv")
str(survey) #check data match what we collected
## 'data.frame': 18 obs. of 4 variables:
## $ r.experience : int 2 1 2 3 1 1 3 1 1 3 ...
## $ discipline : Factor w/ 4 levels "ecology","environmental science",..: 4 3 4 4 1 4 1 1 1 3 ...
## $ research.data: Factor w/ 2 levels "qualitative",..: 2 2 2 2 2 2 2 1 2 2 ...
## $ r.studio : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 1 2 1 ...
#data viz
hist(survey$r.experience, xlab="experience in R (1 is none, 5 is extensive)", ylab="frequency", main="Likert Scale 1 to 5")
plot(survey$r.experience~survey$discipline, xlab="discipline", ylab="experience in R")
plot(survey$r.studio, ylab="R Studio")
plot(survey$research.data, ylab="Research data")
#observe patterns by checking plots
Observations from data viz We have limited experience in R. Experience in R varies by research discipline. A total of half the respondents have tried R Studio. Most participants will be working with quantitative data in their own research projects.
#Now, try some simple summary statistics.
summary(survey)
## r.experience discipline research.data r.studio
## Min. :1.000 ecology :6 qualitative : 1 No :9
## 1st Qu.:1.000 environmental science:1 quantitative:17 Yes:9
## Median :1.000 genetics :2
## Mean :1.667 physiology :9
## 3rd Qu.:2.000
## Max. :3.000
#Data summary looks reasonable based on plots, mean R experience is < 2
t.test(survey$r.experience, mu=1) #t-test if mean is different from 1
##
## One Sample t-test
##
## data: survey$r.experience
## t = 3.3665, df = 17, p-value = 0.003664
## alternative hypothesis: true mean is not equal to 1
## 95 percent confidence interval:
## 1.248861 2.084472
## sample estimates:
## mean of x
## 1.666667
t.test(survey$r.experience, mu=2) #t-test if mean is different from 2
##
## One Sample t-test
##
## data: survey$r.experience
## t = -1.6833, df = 17, p-value = 0.1106
## alternative hypothesis: true mean is not equal to 2
## 95 percent confidence interval:
## 1.248861 2.084472
## sample estimates:
## mean of x
## 1.666667
#A one sample t-test confirms we have a bit experience in R.
m1<-glm(r.experience~discipline, family = poisson, data = survey) #test for differenes between disciplines in R experience
m1 #model summary
##
## Call: glm(formula = r.experience ~ discipline, family = poisson, data = survey)
##
## Coefficients:
## (Intercept) disciplineenvironmental science
## 5.108e-01 -5.108e-01
## disciplinegenetics disciplinephysiology
## 1.823e-01 -6.678e-10
##
## Degrees of Freedom: 17 Total (i.e. Null); 14 Residual
## Null Deviance: 6.808
## Residual Deviance: 6.371 AIC: 56.79
anova(m1, test="Chisq") #test whether the differences in model are different
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: r.experience
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 17 6.8075
## discipline 3 0.43692 14 6.3706 0.9325
#Too little evidence to be significantly different between disciplines.
Practical skill outcomes of R stats useful for competency test Understand the difference between R and R Studio. Use scripts or R Markdown files to save all your work. Be prepared to share you code. Load data, clean data, visualize data, then and only then begin applying statistics. Proximately: be able to use and work with dataframes, vectors, functions, and libraries in R.
worflows reproduce. openly.
data wrangling more than half the battle.
Philosophy of R stats Tidy data make your life easier. Data strutures should match intuition and common sense. Data should have logical structure. Rows are are observations, columns are variables. Tidy data also increase the viability that others can use your data, do better science, reuse science, and help you and your ideas survive and thrive. A workflow should also include the wrangling you did to get your data ready. If data are already very clean in a spreadsheet, they can easily become a literate, logical dataframe. Nonetheless, you should still use annotation within the introductory code to explain the meta-data of your data to some extent and what you did pre-R to get it tidy. The philosophy here is very similar to the data viz lesson forthcoming with two dominant paradigms. Base R code functions, and pipes %>% and the logic embodied within the libraries associated with the the tidyverse. Generally, I prefer the tidyverse because it is more logical and much easier to remember. It also has some specific functions needed to address common errors in modern data structures with little code needed to fix them up.
Worflow template for r-script meta-data #### author: cjlortie date: 2016 purpose:
set-up #### rm(list=ls()) getwd() setwd(“~/r”)
read & check data #### data <-read.csv(“filename.csv”) names() dim() str() summary(data)
visualize ####
check assumptions ####
model data and test hypothesis ####
Data wrangling
Base R key concepts: aggregate tapply sapply lappy subsetting as.factor is.numeric na
tidyverse key concepts: pipes are you best friend! %>%
dplyr filter for rows select for columns mutate for new variables summarise for bringing together many values
Excellent list of wrangling tools
We will cover two basic challenges you will certainly encounter. Missing data
#Missing data. In error, missing cells/observations in some measure can kick back an error. In other apps, sometimes ignored but can introduce error.
ttc <-read.csv("data/ttc.csv")
str(ttc)
## 'data.frame': 31 obs. of 32 variables:
## $ FARE.MEDIA: Factor w/ 21 levels " BLIND/WAR AMPS",..: 19 13 12 16 10 11 7 14 15 2 ...
## $ X2015 : int NA 110945 NA NA 13323 204509 48396 NA 8843 48873 ...
## $ X2014 : int NA 111157 NA NA 9862 214932 42855 NA 9361 49120 ...
## $ X2013 : int NA 112360 NA NA 8194 213982 38426 NA 9557 48623 ...
## $ X2012 : int NA 117962 NA NA 4399 205086 35019 NA 10185 46467 ...
## $ X2011 : int NA 124748 NA NA 1139 194928 32091 NA 9893 43795 ...
## $ X2010 : int NA 120366 1298 NA 0 203101 9200 NA 9237 43149 ...
## $ X2009 : int NA 114686 8807 NA 0 208172 NA NA 8738 41445 ...
## $ X2008 : int NA 94210 34445 NA NA 203313 NA NA 7517 39408 ...
## $ X2007 : int NA 69134 65398 NA NA 195001 NA NA 7126 36317 ...
## $ X2006 : int NA 75340 68546 NA NA 171314 NA NA 5413 38684 ...
## $ X2005 : int NA 82162 73151 NA NA 140594 NA NA 1296 47521 ...
## $ X2004 : int NA 80859 72952 NA NA 125836 NA NA NA 55172 ...
## $ X2003 : int NA 80330 71485 NA NA 119681 NA NA NA 51328 ...
## $ X2002 : int NA 82102 74578 NA NA 116805 NA 762 NA 51052 ...
## $ X2001 : int NA 83771 70930 NA NA 118176 NA 1118 NA 58400 ...
## $ X2000 : int NA 82218 66331 NA NA 112081 NA 1105 NA 61539 ...
## $ X1999 : int NA 83028 64109 NA NA 103447 NA 1430 NA 54835 ...
## $ X1998 : int NA 85303 66490 NA NA 98473 NA 1649 NA 49658 ...
## $ X1997 : int NA 86991 66177 NA NA 91521 NA 1592 NA 46209 ...
## $ X1996 : int NA 87857 67164 4329 NA 86549 NA 1652 NA 32642 ...
## $ X1995 : int NA 87775 70369 12525 NA 96803 NA 1976 NA 20930 ...
## $ X1994 : int NA 97877 62700 13265 NA 96907 NA 2111 NA 20708 ...
## $ X1993 : int NA 104016 57710 12894 NA 100607 NA 2514 NA 22131 ...
## $ X1992 : int NA 114064 53655 11201 NA 109509 NA 2924 NA 23696 ...
## $ X1991 : int NA 111365 35788 17927 NA 108148 NA 3235 NA 60034 ...
## $ X1990 : int NA 119538 38369 22313 NA 116610 NA 2758 NA 67296 ...
## $ X1989 : int NA 114874 37401 27025 NA 113506 NA NA NA 65665 ...
## $ X1988 : int NA 122180 39514 18837 NA 119264 NA NA NA 66872 ...
## $ X1987 : int NA 127088 38944 8976 NA 109151 NA NA NA 75308 ...
## $ X1986 : int NA 126217 42052 7347 NA 101901 NA NA NA 66475 ...
## $ X1985 : int NA 128207 48793 NA NA 94970 NA NA NA 63986 ...
#check for missing values
is.na(ttc)
## FARE.MEDIA X2015 X2014 X2013 X2012 X2011 X2010 X2009 X2008 X2007
## [1,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [4,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## [8,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [17,] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [21,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## X2006 X2005 X2004 X2003 X2002 X2001 X2000 X1999 X1998 X1997 X1996
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [8,] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [17,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [21,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## X1995 X1994 X1993 X1992 X1991 X1990 X1989 X1988 X1987 X1986 X1985
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## [17,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [21,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
summary(ttc, na.rm=TRUE) #excludes NA
## FARE.MEDIA X2015 X2014 X2013
## CASH : 3 Min. : 10 Min. : 12 Min. : 401
## PRESTO : 3 1st Qu.: 1076 1st Qu.: 6087 1st Qu.: 7118
## TICKETS : 3 Median : 12170 Median : 10802 Median : 10850
## SUB-TOTAL : 3 Mean : 68974 Mean : 75560 Mean : 77843
## WEEKLY PASS: 2 3rd Qu.: 48634 3rd Qu.: 49120 3rd Qu.: 52732
## TWO-FARE : 2 Max. :534005 Max. :534815 Max. :525194
## (Other) :15 NA's :8 NA's :10 NA's :11
## X2012 X2011 X2010 X2009
## Min. : 372 Min. : 344 Min. : 0 Min. : 0
## 1st Qu.: 5141 1st Qu.: 4840 1st Qu.: 2539 1st Qu.: 4747
## Median : 11224 Median : 10690 Median : 9237 Median : 9844
## Mean : 76162 Mean : 74148 Mean : 67353 Mean : 69782
## 3rd Qu.: 51249 3rd Qu.: 49146 3rd Qu.: 43149 3rd Qu.: 46170
## Max. :514007 Max. :500219 Max. :477357 Max. :471233
## NA's :11 NA's :11 NA's :10 NA's :11
## X2008 X2007 X2006 X2005
## Min. : 310 Min. : 295 Min. : 58 Min. : 93
## 1st Qu.: 5334 1st Qu.: 4752 1st Qu.: 3978 1st Qu.: 3261
## Median : 11035 Median : 10892 Median : 10120 Median : 9606
## Mean : 72806 Mean : 71736 Mean : 65906 Mean : 63984
## 3rd Qu.: 49701 3rd Qu.: 62491 3rd Qu.: 61156 3rd Qu.: 63630
## Max. :466700 Max. :459769 Max. :444544 Max. :431220
## NA's :12 NA's :12 NA's :11 NA's :11
## X2004 X2003 X2002 X2001
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 4184 1st Qu.: 3805 1st Qu.: 3264 1st Qu.: 2800
## Median : 9940 Median : 10586 Median : 10404 Median : 10765
## Mean : 65379 Mean : 63412 Mean : 61538 Mean : 62471
## 3rd Qu.: 66028 3rd Qu.: 65337 3rd Qu.: 64830 3rd Qu.: 65670
## Max. :418099 Max. :405412 Max. :415539 Max. :419993
## NA's :12 NA's :12 NA's :11 NA's :11
## X2000 X1999 X1998 X1997
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2558 1st Qu.: 2370 1st Qu.: 2283 1st Qu.: 1834
## Median : 10593 Median : 9655 Median : 9530 Median : 9157
## Mean : 61079 Mean : 58365 Mean : 57806 Mean : 56543
## 3rd Qu.: 64276 3rd Qu.: 62428 3rd Qu.: 64004 3rd Qu.: 65008
## Max. :410558 Max. :392593 Max. :388689 Max. :379883
## NA's :11 NA's :11 NA's :11 NA's :11
## X1996 X1995 X1994 X1993
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1652 1st Qu.: 1976 1st Qu.: 2111 1st Qu.: 2507
## Median : 9098 Median : 11338 Median : 12192 Median : 12544
## Mean : 52806 Mean : 55031 Mean : 55096 Mean : 58668
## 3rd Qu.: 67164 3rd Qu.: 70369 3rd Qu.: 62700 3rd Qu.: 61242
## Max. :372430 Max. :388152 Max. :388252 Max. :393485
## NA's :10 NA's :10 NA's :10 NA's :11
## X1992 X1991 X1990 X1989
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1858 1st Qu.: 2904 1st Qu.: 2914 1st Qu.: 2388
## Median : 11385 Median : 14268 Median : 15488 Median : 15177
## Mean : 60409 Mean : 66712 Mean : 72297 Mean : 70940
## 3rd Qu.: 60610 3rd Qu.: 64236 3rd Qu.: 70048 3rd Qu.: 69216
## Max. :404251 Max. :424167 Max. :459234 Max. :450726
## NA's :11 NA's :12 NA's :12 NA's :12
## X1988 X1987 X1986 X1985
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2525 1st Qu.: 3608 1st Qu.: 3427 1st Qu.: 2519
## Median : 16369 Median : 14344 Median : 13996 Median : 15652
## Mean : 72982 Mean : 75921 Mean : 73280 Mean : 76044
## 3rd Qu.: 71772 3rd Qu.: 76750 3rd Qu.: 74446 3rd Qu.: 76829
## Max. :463475 Max. :456884 Max. :441012 Max. :432160
## NA's :12 NA's :13 NA's :13 NA's :14
new.ttc <-na.omit(ttc) # returns without missing values
is.na(new.ttc) #check to see if it worked
## FARE.MEDIA X2015 X2014 X2013 X2012 X2011 X2010 X2009 X2008 X2007 X2006
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 10 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 13 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 14 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 15 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 18 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 19 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 22 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 25 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 26 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 27 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 31 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## X2005 X2004 X2003 X2002 X2001 X2000 X1999 X1998 X1997 X1996 X1995 X1994
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 10 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 13 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 14 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 15 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 18 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 19 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 22 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 25 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 26 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 27 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 31 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## X1993 X1992 X1991 X1990 X1989 X1988 X1987 X1986 X1985
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 10 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 11 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 13 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 14 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 15 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 18 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 19 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 22 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 25 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 26 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 27 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 31 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#many other solutions but I use these two frequently
Selecting part of a dataset
library(dplyr)
survey<-read.csv("data/5081.survey.1.csv")
str(survey)
## 'data.frame': 18 obs. of 4 variables:
## $ r.experience : int 2 1 2 3 1 1 3 1 1 3 ...
## $ discipline : Factor w/ 4 levels "ecology","environmental science",..: 4 3 4 4 1 4 1 1 1 3 ...
## $ research.data: Factor w/ 2 levels "qualitative",..: 2 2 2 2 2 2 2 1 2 2 ...
## $ r.studio : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 1 2 1 ...
#I want just a simple dataframe with experience by discipline
experience <- survey %>% select(discipline, r.experience)
experience
## discipline r.experience
## 1 physiology 2
## 2 genetics 1
## 3 physiology 2
## 4 physiology 3
## 5 ecology 1
## 6 physiology 1
## 7 ecology 3
## 8 ecology 1
## 9 ecology 1
## 10 genetics 3
## 11 ecology 3
## 12 ecology 1
## 13 physiology 1
## 14 physiology 2
## 15 environmental science 1
## 16 physiology 1
## 17 physiology 1
## 18 physiology 2
#Now I just want to select the physiology folks
physiologists <- experience %>% filter(discipline == "physiology")
physiologists
## discipline r.experience
## 1 physiology 2
## 2 physiology 2
## 3 physiology 3
## 4 physiology 1
## 5 physiology 1
## 6 physiology 2
## 7 physiology 1
## 8 physiology 1
## 9 physiology 2
#Selections also often include a summary by levels or I want to make a new column with some calculations. Think about what you have likely done in excel.
#used pipes and made a nice summary table
experience <-survey %>% group_by(discipline) %>% summarise(
count = n(),
exp = mean (r.experience)
)
#What if I just want to make a new column to my dataframe that is a sum, a calculation, or some addition
ttc.5.years <- ttc %>% mutate(five.year.sum = X2015+X2014+X2013+X2012+X2011)
str(ttc.5.years)
## 'data.frame': 31 obs. of 33 variables:
## $ FARE.MEDIA : Factor w/ 21 levels " BLIND/WAR AMPS",..: 19 13 12 16 10 11 7 14 15 2 ...
## $ X2015 : int NA 110945 NA NA 13323 204509 48396 NA 8843 48873 ...
## $ X2014 : int NA 111157 NA NA 9862 214932 42855 NA 9361 49120 ...
## $ X2013 : int NA 112360 NA NA 8194 213982 38426 NA 9557 48623 ...
## $ X2012 : int NA 117962 NA NA 4399 205086 35019 NA 10185 46467 ...
## $ X2011 : int NA 124748 NA NA 1139 194928 32091 NA 9893 43795 ...
## $ X2010 : int NA 120366 1298 NA 0 203101 9200 NA 9237 43149 ...
## $ X2009 : int NA 114686 8807 NA 0 208172 NA NA 8738 41445 ...
## $ X2008 : int NA 94210 34445 NA NA 203313 NA NA 7517 39408 ...
## $ X2007 : int NA 69134 65398 NA NA 195001 NA NA 7126 36317 ...
## $ X2006 : int NA 75340 68546 NA NA 171314 NA NA 5413 38684 ...
## $ X2005 : int NA 82162 73151 NA NA 140594 NA NA 1296 47521 ...
## $ X2004 : int NA 80859 72952 NA NA 125836 NA NA NA 55172 ...
## $ X2003 : int NA 80330 71485 NA NA 119681 NA NA NA 51328 ...
## $ X2002 : int NA 82102 74578 NA NA 116805 NA 762 NA 51052 ...
## $ X2001 : int NA 83771 70930 NA NA 118176 NA 1118 NA 58400 ...
## $ X2000 : int NA 82218 66331 NA NA 112081 NA 1105 NA 61539 ...
## $ X1999 : int NA 83028 64109 NA NA 103447 NA 1430 NA 54835 ...
## $ X1998 : int NA 85303 66490 NA NA 98473 NA 1649 NA 49658 ...
## $ X1997 : int NA 86991 66177 NA NA 91521 NA 1592 NA 46209 ...
## $ X1996 : int NA 87857 67164 4329 NA 86549 NA 1652 NA 32642 ...
## $ X1995 : int NA 87775 70369 12525 NA 96803 NA 1976 NA 20930 ...
## $ X1994 : int NA 97877 62700 13265 NA 96907 NA 2111 NA 20708 ...
## $ X1993 : int NA 104016 57710 12894 NA 100607 NA 2514 NA 22131 ...
## $ X1992 : int NA 114064 53655 11201 NA 109509 NA 2924 NA 23696 ...
## $ X1991 : int NA 111365 35788 17927 NA 108148 NA 3235 NA 60034 ...
## $ X1990 : int NA 119538 38369 22313 NA 116610 NA 2758 NA 67296 ...
## $ X1989 : int NA 114874 37401 27025 NA 113506 NA NA NA 65665 ...
## $ X1988 : int NA 122180 39514 18837 NA 119264 NA NA NA 66872 ...
## $ X1987 : int NA 127088 38944 8976 NA 109151 NA NA NA 75308 ...
## $ X1986 : int NA 126217 42052 7347 NA 101901 NA NA NA 66475 ...
## $ X1985 : int NA 128207 48793 NA NA 94970 NA NA NA 63986 ...
## $ five.year.sum: int NA 577172 NA NA 36917 1033437 196787 NA 47839 236878 ...
str(ttc)
## 'data.frame': 31 obs. of 32 variables:
## $ FARE.MEDIA: Factor w/ 21 levels " BLIND/WAR AMPS",..: 19 13 12 16 10 11 7 14 15 2 ...
## $ X2015 : int NA 110945 NA NA 13323 204509 48396 NA 8843 48873 ...
## $ X2014 : int NA 111157 NA NA 9862 214932 42855 NA 9361 49120 ...
## $ X2013 : int NA 112360 NA NA 8194 213982 38426 NA 9557 48623 ...
## $ X2012 : int NA 117962 NA NA 4399 205086 35019 NA 10185 46467 ...
## $ X2011 : int NA 124748 NA NA 1139 194928 32091 NA 9893 43795 ...
## $ X2010 : int NA 120366 1298 NA 0 203101 9200 NA 9237 43149 ...
## $ X2009 : int NA 114686 8807 NA 0 208172 NA NA 8738 41445 ...
## $ X2008 : int NA 94210 34445 NA NA 203313 NA NA 7517 39408 ...
## $ X2007 : int NA 69134 65398 NA NA 195001 NA NA 7126 36317 ...
## $ X2006 : int NA 75340 68546 NA NA 171314 NA NA 5413 38684 ...
## $ X2005 : int NA 82162 73151 NA NA 140594 NA NA 1296 47521 ...
## $ X2004 : int NA 80859 72952 NA NA 125836 NA NA NA 55172 ...
## $ X2003 : int NA 80330 71485 NA NA 119681 NA NA NA 51328 ...
## $ X2002 : int NA 82102 74578 NA NA 116805 NA 762 NA 51052 ...
## $ X2001 : int NA 83771 70930 NA NA 118176 NA 1118 NA 58400 ...
## $ X2000 : int NA 82218 66331 NA NA 112081 NA 1105 NA 61539 ...
## $ X1999 : int NA 83028 64109 NA NA 103447 NA 1430 NA 54835 ...
## $ X1998 : int NA 85303 66490 NA NA 98473 NA 1649 NA 49658 ...
## $ X1997 : int NA 86991 66177 NA NA 91521 NA 1592 NA 46209 ...
## $ X1996 : int NA 87857 67164 4329 NA 86549 NA 1652 NA 32642 ...
## $ X1995 : int NA 87775 70369 12525 NA 96803 NA 1976 NA 20930 ...
## $ X1994 : int NA 97877 62700 13265 NA 96907 NA 2111 NA 20708 ...
## $ X1993 : int NA 104016 57710 12894 NA 100607 NA 2514 NA 22131 ...
## $ X1992 : int NA 114064 53655 11201 NA 109509 NA 2924 NA 23696 ...
## $ X1991 : int NA 111365 35788 17927 NA 108148 NA 3235 NA 60034 ...
## $ X1990 : int NA 119538 38369 22313 NA 116610 NA 2758 NA 67296 ...
## $ X1989 : int NA 114874 37401 27025 NA 113506 NA NA NA 65665 ...
## $ X1988 : int NA 122180 39514 18837 NA 119264 NA NA NA 66872 ...
## $ X1987 : int NA 127088 38944 8976 NA 109151 NA NA NA 75308 ...
## $ X1986 : int NA 126217 42052 7347 NA 101901 NA NA NA 66475 ...
## $ X1985 : int NA 128207 48793 NA NA 94970 NA NA NA 63986 ...
#so we made a new column.
#can do this to original dataframe too without making new object
ttc %>% mutate(five.year.sum = X2015+X2014+X2013+X2012+X2011)
## FARE.MEDIA X2015 X2014 X2013 X2012 X2011 X2010
## 1 ADULT NA NA NA NA NA NA
## 2 TOKENS 110945 111157 112360 117962 124748 120366
## 3 TICKETS NA NA NA NA NA 1298
## 4 TWO-FARE NA NA NA NA NA NA
## 5 PRESTO 13323 9862 8194 4399 1139 0
## 6 REGULAR MONTHLY PASS 204509 214932 213982 205086 194928 203101
## 7 POST-SECONDARY PASS 48396 42855 38426 35019 32091 9200
## 8 TWIN-GO PASS NA NA NA NA NA NA
## 9 WEEKLY PASS 8843 9361 9557 10185 9893 9237
## 10 CASH 48873 49120 48623 46467 43795 43149
## 11 SUB-TOTAL 434889 437287 431142 419118 406594 386351
## 12 SENIOR/STUDENT NA NA NA NA NA NA
## 13 MONTHLY PASS 25092 23064 20509 19769 18590 17169
## 14 WEEKLY PASS 672 515 540 624 702 814
## 15 TICKETS 32595 33408 35472 37039 38299 38674
## 16 TWO-FARE NA NA NA NA NA NA
## 17 PRESTO 438 12 NA NA NA NA
## 18 CASH 12170 12037 8538 8164 7609 5856
## 19 SUB-TOTAL 70967 69036 65059 65596 65200 62513
## 20 CHILDREN NA NA NA NA NA NA
## 21 FREE RIDES 10939 NA NA NA NA NA
## 22 TICKETS 1066 7097 7563 7929 8304 8287
## 23 PRESTO 10 NA NA NA NA NA
## 24 CASH 526 3705 2708 2589 2433 2539
## 25 SUB-TOTAL 12541 10802 10271 10518 10737 10826
## 26 DAY/VIST./OTHER 8561 10033 11428 11929 10642 10605
## 27 BLIND/WAR AMPS 1086 1119 1109 1086 1060 1073
## 28 PREMIUM EXPRESS 490 451 401 372 344 322
## 29 POSTAL CARRIERS NA NA NA NA NA NA
## 30 GTA PASS 5471 6087 5784 5388 5642 5667
## 31 SYSTEM TOTAL 534005 534815 525194 514007 500219 477357
## X2009 X2008 X2007 X2006 X2005 X2004 X2003 X2002 X2001 X2000
## 1 NA NA NA NA NA NA NA NA NA NA
## 2 114686 94210 69134 75340 82162 80859 80330 82102 83771 82218
## 3 8807 34445 65398 68546 73151 72952 71485 74578 70930 66331
## 4 NA NA NA NA NA NA NA NA NA NA
## 5 0 NA NA NA NA NA NA NA NA NA
## 6 208172 203313 195001 171314 140594 125836 119681 116805 118176 112081
## 7 NA NA NA NA NA NA NA NA NA NA
## 8 NA NA NA NA NA NA NA 762 1118 1105
## 9 8738 7517 7126 5413 1296 NA NA NA NA NA
## 10 41445 39408 36317 38684 47521 55172 51328 51052 58400 61539
## 11 381848 378893 372976 359297 344724 334819 322824 325299 332395 323274
## 12 NA NA NA NA NA NA NA NA NA NA
## 13 15331 14864 14506 12931 11068 9940 10586 11123 12397 11785
## 14 874 780 686 372 93 0 0 0 0 0
## 15 38615 39097 40181 40808 42746 41562 41844 44018 44012 43885
## 16 NA NA NA NA NA NA NA NA NA NA
## 17 NA NA NA NA NA NA NA NA NA NA
## 18 5526 5253 4211 4581 6549 7602 6759 6440 7507 7921
## 19 60346 59994 59584 58692 60456 59104 59189 61581 63916 63591
## 20 NA NA NA NA NA NA NA NA NA NA
## 21 NA NA NA NA NA NA NA NA NA NA
## 22 8562 8782 8959 8879 8143 7573 7915 8869 9133 9401
## 23 NA NA NA NA NA NA NA NA NA NA
## 24 2410 2253 1933 2168 3916 4514 4072 3938 3999 4203
## 25 10972 11035 10892 11047 12059 12087 11987 12807 13132 13604
## 26 10880 9961 9636 9194 7598 6191 5817 9685 4943 4837
## 27 1074 1092 1094 1025 1026 1065 1113 1275 1223 1138
## 28 313 310 295 259 260 278 280 282 339 330
## 29 NA NA NA 58 702 700 664 683 719 753
## 30 5800 5415 5292 4972 4395 3855 3538 3927 3326 3031
## 31 471233 466700 459769 444544 431220 418099 405412 415539 419993 410558
## X1999 X1998 X1997 X1996 X1995 X1994 X1993 X1992 X1991 X1990
## 1 NA NA NA NA NA NA NA NA NA NA
## 2 83028 85303 86991 87857 87775 97877 104016 114064 111365 119538
## 3 64109 66490 66177 67164 70369 62700 57710 53655 35788 38369
## 4 NA NA NA 4329 12525 13265 12894 11201 17927 22313
## 5 NA NA NA NA NA NA NA NA NA NA
## 6 103447 98473 91521 86549 96803 96907 100607 109509 108148 116610
## 7 NA NA NA NA NA NA NA NA NA NA
## 8 1430 1649 1592 1652 1976 2111 2514 2924 3235 2758
## 9 NA NA NA NA NA NA NA NA NA NA
## 10 54835 49658 46209 32642 20930 20708 22131 23696 60034 67296
## 11 306849 301573 292490 280193 290378 293568 299872 315049 336497 366884
## 12 NA NA NA NA NA NA NA NA NA NA
## 13 10124 9419 8647 9098 11028 10716 10508 10389 5850 5117
## 14 0 0 0 0 0 0 0 0 0 0
## 15 44263 46559 48306 52852 58515 56539 56625 57082 56559 61358
## 16 NA NA NA NA NA NA NA NA NA NA
## 17 NA NA NA NA NA NA NA NA NA NA
## 18 7481 7197 7665 8114 5202 4772 4703 3725 6030 6324
## 19 61868 63175 64618 70064 74745 72027 71836 71196 68439 72799
## 20 NA NA NA NA NA NA NA NA NA NA
## 21 NA NA NA NA NA NA NA NA NA NA
## 22 9186 9640 9667 10129 11338 12192 12193 11569 11722 12417
## 23 NA NA NA NA NA NA NA NA NA NA
## 24 4209 4344 4313 3674 2877 2717 2486 1860 2546 3071
## 25 13395 13984 13980 13803 14215 14909 14679 13429 14268 15488
## 26 5576 5236 4611 4234 4833 4764 4228 1853 2572 972
## 27 1131 1137 1057 1057 1059 1119 1122 1127 1093 1098
## 28 301 307 310 322 425 437 417 257 NA NA
## 29 790 783 902 1107 1122 1085 1331 1340 1298 1993
## 30 2683 2494 1915 1650 1375 343 NA NA NA NA
## 31 392593 388689 379883 372430 388152 388252 393485 404251 424167 459234
## X1989 X1988 X1987 X1986 X1985 five.year.sum
## 1 NA NA NA NA NA NA
## 2 114874 122180 127088 126217 128207 577172
## 3 37401 39514 38944 42052 48793 NA
## 4 27025 18837 8976 7347 NA NA
## 5 NA NA NA NA NA 36917
## 6 113506 119264 109151 101901 94970 1033437
## 7 NA NA NA NA NA 196787
## 8 NA NA NA NA NA NA
## 9 NA NA NA NA NA 47839
## 10 65665 66872 75308 66475 63986 236878
## 11 358471 366667 359467 343992 335956 2129030
## 12 NA NA NA NA NA NA
## 13 4570 4541 3855 3288 2519 107024
## 14 0 0 0 0 0 3053
## 15 61837 65708 65927 62963 61193 176813
## 16 93 63 NA NA NA NA
## 17 NA NA NA NA NA NA
## 18 6267 6359 7449 10852 13117 48518
## 19 72767 76671 77231 77103 76829 335858
## 20 NA NA NA NA NA NA
## 21 NA NA NA NA NA NA
## 22 12261 13194 12581 12074 10884 31959
## 23 NA NA NA NA NA NA
## 24 2916 3175 3526 3844 4768 11961
## 25 15177 16369 16107 15918 15652 54869
## 26 1312 641 612 649 630 52593
## 27 1139 1252 1593 1506 1272 5460
## 28 NA NA NA NA NA 2058
## 29 1860 1875 1874 1844 1821 NA
## 30 NA NA NA NA NA 28372
## 31 450726 463475 456884 441012 432160 2608240
ttc
## FARE.MEDIA X2015 X2014 X2013 X2012 X2011 X2010
## 1 ADULT NA NA NA NA NA NA
## 2 TOKENS 110945 111157 112360 117962 124748 120366
## 3 TICKETS NA NA NA NA NA 1298
## 4 TWO-FARE NA NA NA NA NA NA
## 5 PRESTO 13323 9862 8194 4399 1139 0
## 6 REGULAR MONTHLY PASS 204509 214932 213982 205086 194928 203101
## 7 POST-SECONDARY PASS 48396 42855 38426 35019 32091 9200
## 8 TWIN-GO PASS NA NA NA NA NA NA
## 9 WEEKLY PASS 8843 9361 9557 10185 9893 9237
## 10 CASH 48873 49120 48623 46467 43795 43149
## 11 SUB-TOTAL 434889 437287 431142 419118 406594 386351
## 12 SENIOR/STUDENT NA NA NA NA NA NA
## 13 MONTHLY PASS 25092 23064 20509 19769 18590 17169
## 14 WEEKLY PASS 672 515 540 624 702 814
## 15 TICKETS 32595 33408 35472 37039 38299 38674
## 16 TWO-FARE NA NA NA NA NA NA
## 17 PRESTO 438 12 NA NA NA NA
## 18 CASH 12170 12037 8538 8164 7609 5856
## 19 SUB-TOTAL 70967 69036 65059 65596 65200 62513
## 20 CHILDREN NA NA NA NA NA NA
## 21 FREE RIDES 10939 NA NA NA NA NA
## 22 TICKETS 1066 7097 7563 7929 8304 8287
## 23 PRESTO 10 NA NA NA NA NA
## 24 CASH 526 3705 2708 2589 2433 2539
## 25 SUB-TOTAL 12541 10802 10271 10518 10737 10826
## 26 DAY/VIST./OTHER 8561 10033 11428 11929 10642 10605
## 27 BLIND/WAR AMPS 1086 1119 1109 1086 1060 1073
## 28 PREMIUM EXPRESS 490 451 401 372 344 322
## 29 POSTAL CARRIERS NA NA NA NA NA NA
## 30 GTA PASS 5471 6087 5784 5388 5642 5667
## 31 SYSTEM TOTAL 534005 534815 525194 514007 500219 477357
## X2009 X2008 X2007 X2006 X2005 X2004 X2003 X2002 X2001 X2000
## 1 NA NA NA NA NA NA NA NA NA NA
## 2 114686 94210 69134 75340 82162 80859 80330 82102 83771 82218
## 3 8807 34445 65398 68546 73151 72952 71485 74578 70930 66331
## 4 NA NA NA NA NA NA NA NA NA NA
## 5 0 NA NA NA NA NA NA NA NA NA
## 6 208172 203313 195001 171314 140594 125836 119681 116805 118176 112081
## 7 NA NA NA NA NA NA NA NA NA NA
## 8 NA NA NA NA NA NA NA 762 1118 1105
## 9 8738 7517 7126 5413 1296 NA NA NA NA NA
## 10 41445 39408 36317 38684 47521 55172 51328 51052 58400 61539
## 11 381848 378893 372976 359297 344724 334819 322824 325299 332395 323274
## 12 NA NA NA NA NA NA NA NA NA NA
## 13 15331 14864 14506 12931 11068 9940 10586 11123 12397 11785
## 14 874 780 686 372 93 0 0 0 0 0
## 15 38615 39097 40181 40808 42746 41562 41844 44018 44012 43885
## 16 NA NA NA NA NA NA NA NA NA NA
## 17 NA NA NA NA NA NA NA NA NA NA
## 18 5526 5253 4211 4581 6549 7602 6759 6440 7507 7921
## 19 60346 59994 59584 58692 60456 59104 59189 61581 63916 63591
## 20 NA NA NA NA NA NA NA NA NA NA
## 21 NA NA NA NA NA NA NA NA NA NA
## 22 8562 8782 8959 8879 8143 7573 7915 8869 9133 9401
## 23 NA NA NA NA NA NA NA NA NA NA
## 24 2410 2253 1933 2168 3916 4514 4072 3938 3999 4203
## 25 10972 11035 10892 11047 12059 12087 11987 12807 13132 13604
## 26 10880 9961 9636 9194 7598 6191 5817 9685 4943 4837
## 27 1074 1092 1094 1025 1026 1065 1113 1275 1223 1138
## 28 313 310 295 259 260 278 280 282 339 330
## 29 NA NA NA 58 702 700 664 683 719 753
## 30 5800 5415 5292 4972 4395 3855 3538 3927 3326 3031
## 31 471233 466700 459769 444544 431220 418099 405412 415539 419993 410558
## X1999 X1998 X1997 X1996 X1995 X1994 X1993 X1992 X1991 X1990
## 1 NA NA NA NA NA NA NA NA NA NA
## 2 83028 85303 86991 87857 87775 97877 104016 114064 111365 119538
## 3 64109 66490 66177 67164 70369 62700 57710 53655 35788 38369
## 4 NA NA NA 4329 12525 13265 12894 11201 17927 22313
## 5 NA NA NA NA NA NA NA NA NA NA
## 6 103447 98473 91521 86549 96803 96907 100607 109509 108148 116610
## 7 NA NA NA NA NA NA NA NA NA NA
## 8 1430 1649 1592 1652 1976 2111 2514 2924 3235 2758
## 9 NA NA NA NA NA NA NA NA NA NA
## 10 54835 49658 46209 32642 20930 20708 22131 23696 60034 67296
## 11 306849 301573 292490 280193 290378 293568 299872 315049 336497 366884
## 12 NA NA NA NA NA NA NA NA NA NA
## 13 10124 9419 8647 9098 11028 10716 10508 10389 5850 5117
## 14 0 0 0 0 0 0 0 0 0 0
## 15 44263 46559 48306 52852 58515 56539 56625 57082 56559 61358
## 16 NA NA NA NA NA NA NA NA NA NA
## 17 NA NA NA NA NA NA NA NA NA NA
## 18 7481 7197 7665 8114 5202 4772 4703 3725 6030 6324
## 19 61868 63175 64618 70064 74745 72027 71836 71196 68439 72799
## 20 NA NA NA NA NA NA NA NA NA NA
## 21 NA NA NA NA NA NA NA NA NA NA
## 22 9186 9640 9667 10129 11338 12192 12193 11569 11722 12417
## 23 NA NA NA NA NA NA NA NA NA NA
## 24 4209 4344 4313 3674 2877 2717 2486 1860 2546 3071
## 25 13395 13984 13980 13803 14215 14909 14679 13429 14268 15488
## 26 5576 5236 4611 4234 4833 4764 4228 1853 2572 972
## 27 1131 1137 1057 1057 1059 1119 1122 1127 1093 1098
## 28 301 307 310 322 425 437 417 257 NA NA
## 29 790 783 902 1107 1122 1085 1331 1340 1298 1993
## 30 2683 2494 1915 1650 1375 343 NA NA NA NA
## 31 392593 388689 379883 372430 388152 388252 393485 404251 424167 459234
## X1989 X1988 X1987 X1986 X1985
## 1 NA NA NA NA NA
## 2 114874 122180 127088 126217 128207
## 3 37401 39514 38944 42052 48793
## 4 27025 18837 8976 7347 NA
## 5 NA NA NA NA NA
## 6 113506 119264 109151 101901 94970
## 7 NA NA NA NA NA
## 8 NA NA NA NA NA
## 9 NA NA NA NA NA
## 10 65665 66872 75308 66475 63986
## 11 358471 366667 359467 343992 335956
## 12 NA NA NA NA NA
## 13 4570 4541 3855 3288 2519
## 14 0 0 0 0 0
## 15 61837 65708 65927 62963 61193
## 16 93 63 NA NA NA
## 17 NA NA NA NA NA
## 18 6267 6359 7449 10852 13117
## 19 72767 76671 77231 77103 76829
## 20 NA NA NA NA NA
## 21 NA NA NA NA NA
## 22 12261 13194 12581 12074 10884
## 23 NA NA NA NA NA
## 24 2916 3175 3526 3844 4768
## 25 15177 16369 16107 15918 15652
## 26 1312 641 612 649 630
## 27 1139 1252 1593 1506 1272
## 28 NA NA NA NA NA
## 29 1860 1875 1874 1844 1821
## 30 NA NA NA NA NA
## 31 450726 463475 456884 441012 432160
#notice any errors? :)
Practical skill outcomes of R stats useful for competency test Check and address missing values. Grab part of a dataset. Use pipes to move/wrangle chunks of your dataset
basic plots. lattice. ggplot2. you need to see data. see the trends. explore them using visuals.
Contemporary data viz for statistical analyses slide deck
Philosophy of R stats Clean simple graphics are powerful tools in statistics (and in scientific communication). Tufte and others have shaped data scientists and statisticians in developing more libraries, new standards, and assumptions associated with graphical representations of data. Data viz must highlight the differences, show underlying data structures, and provide insights into the specific research project. R is infinitely customizable in all these respects. There are at least two major current paradigms (there are more these are the two dominant idea sets). Base R plots are simple, relatively flexible, and very easy. However, their grammar, i.e their rules of coding are not modern. Ggplot and related libraries invoke a new, formal grammar of graphics that is more logical, more flexible, but divergent from base R code. It is worth the time to understand the differences and know when to use each.
Evolution of plotting in statistics using R in particular went from base-R then onto lattice then to the ggvis universe with the most recent library being ggplot2. Base-R is certainly useful in some contexts as is the lattice and lattice extra library. However, ggplot2 now encompasses all these capacities with a much simpler set of grammar (i.e. rules and order). Nonetheless, you should be able to read base-R code for plots and be able to do some as well. The philosophy or grammar of modern graphics is well articulated and includes the following key principles.
The grammar of graphics layers primacy of layers (simple first, then more complex) i.e. you build up your plots data are mapped to aesthetic attributes and geometric objects data first then statistics even in plots
Disclaimer: I love the power of qplots.
Data viz case study #1.
library(ggplot2)
survey<-read.csv("data/5081.survey.1.csv")
str(survey)
## 'data.frame': 18 obs. of 4 variables:
## $ r.experience : int 2 1 2 3 1 1 3 1 1 3 ...
## $ discipline : Factor w/ 4 levels "ecology","environmental science",..: 4 3 4 4 1 4 1 1 1 3 ...
## $ research.data: Factor w/ 2 levels "qualitative",..: 2 2 2 2 2 2 2 1 2 2 ...
## $ r.studio : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 2 1 2 1 ...
plot(survey$r.experience) #hard to tell what is going on
qplot(r.experience, data=survey) #decided to make bins for me and count up)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#so, we know better and instead do a histogram using base graphics
#basic data viz for EDA
hist(survey$r.experience) #better
qplot(r.experience, data=survey, geom="histogram") #same as what is picked for us
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(r.experience, data=survey, geom="histogram", binwidth=0.5)
barplot(survey$r.experience) #confusing
qplot(r.experience, data=survey, geom="bar") #what, it is back!
#basic data viz for EDA but for interactions
plot(survey$discipline, survey$r.experience)
qplot(discipline, r.experience, data=survey) #not the same
qplot(discipline, r.experience, data=survey, geom="boxplot")
plot(survey$r.studio~survey$r.experience) #ugly
qplot(r.experience, r.studio, data=survey) #useless
qplot(r.studio, data=survey, weight = r.experience) #sweet new feature here
#ok, so you get it. grammar different, visuals about the same for super quick, simple plots. The grammar hints at the power that awaits though.
#grammar different, simple x or x~y plots about the same
Data viz case study #2.
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
#crazy number of observations. We need less. too many riches not always good.
set.seed(1000)
dsmall<-diamonds[sample(nrow(diamonds), 100), ]
plot(dsmall$carat, dsmall$price)
qplot(carat, price, data=dsmall)
#ok no difference
#now let's see what we can do with qplot with a few bits added
#one little argument extra added
qplot(carat, price, data = dsmall, colour = color)
qplot(carat, price, data = dsmall, shape = cut)
#how about using data viz to even more thoroughly explore potential stats we could do.
#qplots - quick plot, thoughful build of layers
qplot(carat, price, data = dsmall, geom = c("point", "smooth"))
#what about trying some stats on this now, at least from a viz philosophy
qplot(color, price / carat, data = dsmall, geom = "boxplot") #can include formulas and methods
#or check for proportional differences
qplot(carat, data = dsmall, geom = "histogram", fill = color) #to see proportions
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(carat, data = dsmall, geom = "histogram", weight = price) # weight by a covariate
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#final idea, how about subsetting with the data right in the code for the plots!
qplot(carat, data = diamonds, facets = color ~ .,
geom = "histogram", binwidth = 0.1, xlim = c(0, 3)) #to compare between groups
## Warning: Removed 32 rows containing non-finite values (stat_bin).
#qplot is so powerful.
#colour, shape and size have meaning in the R code from this library
#layers added for you by qplots
#qplot gets you 2x and one y or one x and 2y so >2 variables at once easily
Data viz case study #3.
#GGPLOT() gives you even more options for adding layers/options
p <- ggplot(mtcars, aes(x = mpg, y = wt))
p + geom_point()
#now play time with this case study.
#try out some geom options and different aesthetics and make some errors.
#prize for the prettiest plots
#displ is car engine size in Liters
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
#so aethetics are one way to add variables in and expand your plotting power
#however facets are another way to make multiple plots BY a factor
#facet wrap is by one variable
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
#facet_wrap(~cell) - univariate: create a 1-d strip of panels, based on one factor, and wrap the strip into a 2-d matrix
#facet_grid(row~col) - (usually) bivariate: create a 2-d matrix of panels, based on two factors
#facet grid is by two variables
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)
#another example more perfect code
p <- ggplot(data = mpg, aes(x = displ, y = hwy, color = drv)) + geom_point()
p + facet_wrap(~cyl)
#or just use facets in qplot but much simpler
qplot(displ, hwy, data=mpg, facets = . ~ year) + geom_smooth()
Data viz case study #4.
#try it with ecological.footprints.csv
footprints <-read.csv("data/ecological.footprints.csv")
str(footprints)
## 'data.frame': 191 obs. of 3 variables:
## $ country : Factor w/ 145 levels "Albania","Algeria",..: 137 139 71 37 92 63 6 45 25 123 ...
## $ ecological.footprint: num 15.99 12.22 10.31 9.88 9.54 ...
## $ yr : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
#aha, R thinks year is an integet
footprints$yr <- as.factor(footprints$yr)
library(ggplot2)
qplot(country, ecological.footprint, data = footprints) #too messy
qplot(country, ecological.footprint, data = footprints, colour = yr) #better but still a lot to process
qplot(country, ecological.footprint, data = footprints, facets = yr~.) #ok but do not love. hard to see distribution
qplot(ecological.footprint, data = footprints) #you know what, this is not bad. maybe add year in too/
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(ecological.footprint, data = footprints, fill = yr) #ok now I starting to see structure and differences
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#OK, now I am ready for stats. Thinking about these data, I see we have only two years for some countries so cannot do within country or between country contrasts. So, most likely hypothesis I can test is whether ecological footprints are increasing between these two years. Not a perfect dataset really but could compare these two years.
t.test(footprints$ecological.footprint~ footprints$yr)
##
## Welch Two Sample t-test
##
## data: footprints$ecological.footprint by footprints$yr
## t = 2.7386, df = 174.55, p-value = 0.00681
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2207405 1.3597701
## sample estimates:
## mean in group 2000 mean in group 2012
## 3.124255 2.334000
#ok looks like there are differences between years but it is hard to tell from previous plot. Realize now, I need a better plot still?
qplot(yr, ecological.footprint, data = footprints, geom="boxplot") #this is weird, 2000 looks higher
#different countries between years?
#more countries reported in 2000?
library(dplyr)
footprints %>% count(yr)
## # A tibble: 2 × 2
## yr n
## <fctr> <int>
## 1 2000 141
## 2 2012 50
#Yup, way more data for year 2000
#maybe should just the countries that were testedin both years?
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
matches <- spread(footprints, yr, ecological.footprint) %>% filter() %>% na.omit
str(matches)
## 'data.frame': 46 obs. of 3 variables:
## $ country: Factor w/ 145 levels "Albania","Algeria",..: 1 2 4 7 9 18 28 30 31 34 ...
## $ 2000 : num 1.86 1.79 3.79 5.45 0.6 2.6 3.39 1.9 2.77 2.1 ...
## $ 2012 : num 1.8 1.6 2.7 5.3 0.7 2.9 3.2 1.8 2.5 1.9 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:99] 3 5 6 8 10 11 12 13 14 15 ...
## .. ..- attr(*, "names")= chr [1:99] "3" "5" "6" "8" ...
matches
## country 2000 2012
## 1 Albania 1.86 1.8
## 2 Algeria 1.79 1.6
## 4 Argentina 3.79 2.7
## 7 Austria 5.45 5.3
## 9 Bangladesh 0.60 0.7
## 18 Brazil 2.60 2.9
## 28 Chile 3.39 3.2
## 30 Colombia 1.90 1.8
## 31 Costa Rica 2.77 2.5
## 34 Cuba 2.10 1.9
## 40 Ecuador 2.26 2.4
## 42 El Salvador 1.55 2.0
## 46 France 7.27 4.9
## 48 Germany 6.31 4.6
## 51 Guatemala 1.40 1.8
## 56 Honduras 1.43 1.7
## 59 India 1.06 0.9
## 60 Indonesia 1.48 1.1
## 62 Iraq 1.73 1.4
## 64 Israel 5.40 4.0
## 66 Jamaica 2.68 1.7
## 67 Japan 5.94 4.2
## 68 Jordan 1.71 2.1
## 72 Kyrgyzstan 1.87 1.3
## 73 Laos 0.91 1.3
## 79 Madagascar 0.93 1.2
## 84 Mexico 2.69 3.3
## 85 Moldova 2.47 2.1
## 87 Morocco 1.56 1.3
## 92 New Zealand 9.54 4.3
## 93 Nicaragua 1.26 1.6
## 97 Norway 6.13 4.8
## 99 Pakistan 1.09 0.8
## 101 Panama 2.35 3.0
## 104 Peru 1.33 2.0
## 105 Philippines 1.42 1.0
## 121 Sri Lanka 0.95 1.2
## 124 Switzerland 6.63 5.0
## 125 Syria 2.56 1.5
## 126 Tajikistan 0.90 0.9
## 128 Thailand 2.70 2.4
## 132 Tunisia 2.27 1.8
## 133 Turkey 2.73 2.6
## 138 United Kingdom 6.29 4.7
## 142 Venezuela 2.88 3.0
## 143 Vietnam 0.95 1.4
#whoa, USA and Canada missing, and we have HUGE footprints.
#got it. just the countries with measure in BOTH years.
#now, gather up again with these filtered matches!
new <- matches %>% gather(`2000`, `2012`, key = "yr", value ="ecological.footprint")
new #ok so now a nice dataframe with just matches back in a format I can use for plots and stats
## country yr ecological.footprint
## 1 Albania 2000 1.86
## 2 Algeria 2000 1.79
## 3 Argentina 2000 3.79
## 4 Austria 2000 5.45
## 5 Bangladesh 2000 0.60
## 6 Brazil 2000 2.60
## 7 Chile 2000 3.39
## 8 Colombia 2000 1.90
## 9 Costa Rica 2000 2.77
## 10 Cuba 2000 2.10
## 11 Ecuador 2000 2.26
## 12 El Salvador 2000 1.55
## 13 France 2000 7.27
## 14 Germany 2000 6.31
## 15 Guatemala 2000 1.40
## 16 Honduras 2000 1.43
## 17 India 2000 1.06
## 18 Indonesia 2000 1.48
## 19 Iraq 2000 1.73
## 20 Israel 2000 5.40
## 21 Jamaica 2000 2.68
## 22 Japan 2000 5.94
## 23 Jordan 2000 1.71
## 24 Kyrgyzstan 2000 1.87
## 25 Laos 2000 0.91
## 26 Madagascar 2000 0.93
## 27 Mexico 2000 2.69
## 28 Moldova 2000 2.47
## 29 Morocco 2000 1.56
## 30 New Zealand 2000 9.54
## 31 Nicaragua 2000 1.26
## 32 Norway 2000 6.13
## 33 Pakistan 2000 1.09
## 34 Panama 2000 2.35
## 35 Peru 2000 1.33
## 36 Philippines 2000 1.42
## 37 Sri Lanka 2000 0.95
## 38 Switzerland 2000 6.63
## 39 Syria 2000 2.56
## 40 Tajikistan 2000 0.90
## 41 Thailand 2000 2.70
## 42 Tunisia 2000 2.27
## 43 Turkey 2000 2.73
## 44 United Kingdom 2000 6.29
## 45 Venezuela 2000 2.88
## 46 Vietnam 2000 0.95
## 47 Albania 2012 1.80
## 48 Algeria 2012 1.60
## 49 Argentina 2012 2.70
## 50 Austria 2012 5.30
## 51 Bangladesh 2012 0.70
## 52 Brazil 2012 2.90
## 53 Chile 2012 3.20
## 54 Colombia 2012 1.80
## 55 Costa Rica 2012 2.50
## 56 Cuba 2012 1.90
## 57 Ecuador 2012 2.40
## 58 El Salvador 2012 2.00
## 59 France 2012 4.90
## 60 Germany 2012 4.60
## 61 Guatemala 2012 1.80
## 62 Honduras 2012 1.70
## 63 India 2012 0.90
## 64 Indonesia 2012 1.10
## 65 Iraq 2012 1.40
## 66 Israel 2012 4.00
## 67 Jamaica 2012 1.70
## 68 Japan 2012 4.20
## 69 Jordan 2012 2.10
## 70 Kyrgyzstan 2012 1.30
## 71 Laos 2012 1.30
## 72 Madagascar 2012 1.20
## 73 Mexico 2012 3.30
## 74 Moldova 2012 2.10
## 75 Morocco 2012 1.30
## 76 New Zealand 2012 4.30
## 77 Nicaragua 2012 1.60
## 78 Norway 2012 4.80
## 79 Pakistan 2012 0.80
## 80 Panama 2012 3.00
## 81 Peru 2012 2.00
## 82 Philippines 2012 1.00
## 83 Sri Lanka 2012 1.20
## 84 Switzerland 2012 5.00
## 85 Syria 2012 1.50
## 86 Tajikistan 2012 0.90
## 87 Thailand 2012 2.40
## 88 Tunisia 2012 1.80
## 89 Turkey 2012 2.60
## 90 United Kingdom 2012 4.70
## 91 Venezuela 2012 3.00
## 92 Vietnam 2012 1.40
qplot(yr, ecological.footprint, data = footprints, geom="boxplot")
t.test(new$ecological.footprint~ new$yr, paired=TRUE)
##
## Paired t-test
##
## data: new$ecological.footprint by new$yr
## t = 2.7553, df = 45, p-value = 0.008434
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1121662 0.7217469
## sample estimates:
## mean of the differences
## 0.4169565
#Well, I give up, seems like the footprints for the world went down not up in this time period. GOOD NEWS for the environmental movement in some respects.
new %>% group_by(yr) %>% summarise(mean(ecological.footprint))
## # A tibble: 2 × 2
## yr `mean(ecological.footprint)`
## <chr> <dbl>
## 1 2000 2.801739
## 2 2012 2.384783
# We still use 2.4 planets worth of resources but only have one.
#AND, we are missing some key countries that contribute to global change including Canada and USA.
Practical skill outcomes of R stats useful for competency test Do meaningful plot of a single variable wth nice labels. Do a meaningful plot of a relationship between two variables. Use qplots to do a plot that includes an aesthetic. Do a ggplot that includes three variables.
Physics example in Wired Mag why to plot data even if you have a model
fundamental descriptive stats. distributions via density plots GLM. GLMM. Post hoc tests. modelR
Philosophy of R stats Exploratory data analyses is everything we have done. a. Tidy data. b. Inspect data structure. c. Data viz. d. Basic exploratory data analyses.
However, now that we are ready to apply models, we add in one more tiny step. Visualize the data to better understand its typology and underlying distribution. Then, you are ready to fit your models.
A statistical model is an elegant, representative simplification of the patterns you have identified through data viz and EDA. It should capture data/experimental structure including the key variables, appropriate levels, and relevant covariation or contexts that mediate outcomes. It should support the data viz. It should provide an estimate of the statistical likelihood or probability of differences. Ideally, the underlying coefficients should also be mined to convey an estimate of effect sizes. A t.test, chi.square test, linear model, general linear model, or generalized linear mixed model are all examples of models that describe and summarize patterns and each have associated assumptions about the data they embody. Hence, the final step pre-model fit, is explore distributions.
Conceptually, there are two kind of models. Those that look back and those that look forward. Think tardis or time machine. A model is always a snapshot using your time machine. It can be a grab of what happened or a future snap of what you predict. In R, there is simple code to time travel in either direction. Actually, there is no time - Arrow of time - only an observer potential perception of it. Statistical models are our observers here. These observers use ‘probability distributions’ as we described in the first week sensu statistical thinking to calibrate what the think we observed or will observe given the evidence at hand.
Case study #1: x,y continuous
library(ggplot2)
library(dplyr)
library(modelr)
#Data viz for pattern
qplot(x,y, data = sim1)
#Now inspect distribution of y
ggplot(sim1) +
geom_density(mapping = aes(y))
ggplot(sim1) +
geom_histogram(mapping = aes(y), binwidth=5)
shapiro.test(sim1$y) #p-value <0.05 means is different from normal
##
## Shapiro-Wilk normality test
##
## data: sim1$y
## W = 0.97483, p-value = 0.6777
qqnorm(sim1$y) #qqplots common to inspsect quantiles
#not significantly different from normal. great.
#model time!
#Remember, you own the purpose! Does x predict y? Linear model straight up!
m1 <-lm(y~x, data = sim1)
summary(m1)
##
## Call:
## lm(formula = y ~ x, data = sim1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1469 -1.5197 0.1331 1.4670 4.6516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2208 0.8688 4.858 4.09e-05 ***
## x 2.0515 0.1400 14.651 1.17e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.203 on 28 degrees of freedom
## Multiple R-squared: 0.8846, Adjusted R-squared: 0.8805
## F-statistic: 214.7 on 1 and 28 DF, p-value: 1.173e-14
#wow, these simulated are too good. Remember, this is the backwards, time travel, hypothesis test given what we have. Not prediction of new data per se but description of observation of patterns.
coef(m1) #remember 'minecraft' your model a bit to get a sense of effects.
## (Intercept) x
## 4.220822 2.051533
Case study #2 categorical x, continuous y
ggplot(sim2) +
geom_point(aes(x, y)) #categorical x
ggplot(sim2) +
geom_density(mapping = aes(y))
ggplot(sim2) +
geom_histogram(mapping = aes(y), binwidth=1) #changing binwidth really changes perception of distribution
shapiro.test(sim2$y)
##
## Shapiro-Wilk normality test
##
## data: sim2$y
## W = 0.92313, p-value = 0.009676
qqnorm(sim2$y) #non-normal
#so, could do anova but likely not great link to underlying probability distribution
m2 <- aov(y~x, data = sim2)
summary(m2)
## Df Sum Sq Mean Sq F value Pr(>F)
## x 3 335.1 111.71 92.52 <2e-16 ***
## Residuals 36 43.5 1.21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
names(m2)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"
plot(m2$residuals) #residuals not bad
#library(fitdistrplus)
m3 <- glm(y~x, data = sim2, family = "quasi")
summary(m3) #better
##
## Call:
## glm(formula = y ~ x, family = "quasi", data = sim2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.40131 -0.43996 -0.05776 0.49066 2.63938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1522 0.3475 3.316 0.00209 **
## xb 6.9639 0.4914 14.171 2.68e-16 ***
## xc 4.9750 0.4914 10.124 4.47e-12 ***
## xd 0.7588 0.4914 1.544 0.13131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasi family taken to be 1.207495)
##
## Null deviance: 378.61 on 39 degrees of freedom
## Residual deviance: 43.47 on 36 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 2
#you can also explore distributions in even more detail to ensure correct model (correct = matches/describes underlying structure AND data distribution)
anova(m3, test="Chisq") #is x significant factor?
## Analysis of Deviance Table
##
## Model: quasi, link: identity
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 39 378.61
## x 3 335.14 36 43.47 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Cullen and Frey Graphs are cool
#fitdistrplus not bad. Usually, the type of data, ie. count, frequency, proportion is just as effective on deciding on family type vesus distribution exploration. The goal is to fit the 'best' model. Best is simplest and representative. Formal tools to contrast models sometimes help too.
#note, mixed models, can use lme4 package if some effects and others are random. Need to think this over. Fixed = groups or levels in factors not due to random causes, random effects = likely from random causes or latent drivers such as population/species specificity.
Case study #3: interactions cat.x, cont.x, y
str(sim3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 120 obs. of 5 variables:
## $ x1 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ x2 : Factor w/ 4 levels "a","b","c","d": 1 1 1 2 2 2 3 3 3 4 ...
## $ rep: int 1 2 3 1 2 3 1 2 3 1 ...
## $ y : Named num -0.571 1.184 2.237 7.437 8.518 ...
## ..- attr(*, "names")= chr "a" "a" "a" "b" ...
## $ sd : num 2 2 2 2 2 2 2 2 2 2 ...
ggplot(sim3) +
geom_point(aes(x1, y, colour = x2))
#x1 is continous
#x2 is categorical
#Q are there an effect of xs on y and do the effect interact, i.e. level of x1 influence changes by x2.
ggplot(sim3) +
geom_density(mapping = aes(y))
ggplot(sim3) +
geom_histogram(mapping = aes(y), binwidth=1)
shapiro.test(sim3$y)
##
## Shapiro-Wilk normality test
##
## data: sim3$y
## W = 0.97593, p-value = 0.0299
qqnorm(sim3$y)
#s.d. from normal but not bad. could be contigent on the levels of x.
ggplot(sim3, aes(x1, y)) +
geom_boxplot(aes(colour = x2))
#so, looks like the distribution of y relates to the factors. Likely good to go on parametric linear model.
m4 <-lm(y~x1*x2, data = sim3) #interactions terms for all levels
summary(m4)
##
## Call:
## lm(formula = y ~ x1 * x2, data = sim3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.87634 -0.67655 0.04837 0.69963 2.58607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.30124 0.40400 3.221 0.00167 **
## x1 -0.09302 0.06511 -1.429 0.15587
## x2b 7.06938 0.57134 12.373 < 2e-16 ***
## x2c 4.43090 0.57134 7.755 4.41e-12 ***
## x2d 0.83455 0.57134 1.461 0.14690
## x1:x2b -0.76029 0.09208 -8.257 3.30e-13 ***
## x1:x2c 0.06815 0.09208 0.740 0.46076
## x1:x2d 0.27728 0.09208 3.011 0.00322 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.024 on 112 degrees of freedom
## Multiple R-squared: 0.8221, Adjusted R-squared: 0.811
## F-statistic: 73.93 on 7 and 112 DF, p-value: < 2.2e-16
m5 <-lm(y~x1+x2, data = sim3) #independent x1 & x2 effects on 7 modeled.
summary(m5)
##
## Call:
## lm(formula = y ~ x1 + x2, data = sim3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4674 -0.8524 -0.0729 0.7886 4.3005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87167 0.38738 4.832 4.22e-06 ***
## x1 -0.19674 0.04871 -4.039 9.72e-05 ***
## x2b 2.88781 0.39571 7.298 4.07e-11 ***
## x2c 4.80574 0.39571 12.145 < 2e-16 ***
## x2d 2.35959 0.39571 5.963 2.79e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.533 on 115 degrees of freedom
## Multiple R-squared: 0.5911, Adjusted R-squared: 0.5768
## F-statistic: 41.55 on 4 and 115 DF, p-value: < 2.2e-16
plot(m5) #sometimes I plot the model to explore/mine model for its capacity to describe the patterns
anova(m5, test="Chisq") #tells you if effects are significant.
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 38.32 38.319 16.314 9.718e-05 ***
## x2 3 352.07 117.358 49.966 < 2.2e-16 ***
## Residuals 115 270.11 2.349
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#you can also contrast different models using the anova of different models.
m.b <-anova(m4,m5, test="Chisq")
m.b
## Analysis of Variance Table
##
## Model 1: y ~ x1 * x2
## Model 2: y ~ x1 + x2
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 112 117.51
## 2 115 270.11 -3 -152.59 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Case study #4: interactions cont.x, cont.x, y
str(sim4)
## Classes 'tbl_df', 'tbl' and 'data.frame': 300 obs. of 4 variables:
## $ x1 : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ x2 : num -1 -1 -1 -0.778 -0.778 ...
## $ rep: int 1 2 3 1 2 3 1 2 3 1 ...
## $ y : num 4.2477 1.206 0.3535 -0.0467 4.6387 ...
ggplot(sim4) +
geom_point(aes(x1, y, colour = x2))
ggplot(sim4) +
geom_density(mapping = aes(y))
ggplot(sim4) +
geom_histogram(mapping = aes(y), binwidth=1)
shapiro.test(sim4$y) #more divegence from normality
##
## Shapiro-Wilk normality test
##
## data: sim4$y
## W = 0.98703, p-value = 0.008507
qqnorm(sim4$y)
m6 <-glm(y~x1*x2, data = sim4)
summary(m6) #significant interaction term in model
##
## Call:
## glm(formula = y ~ x1 * x2, data = sim4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.9629 -1.4165 -0.1032 1.4284 4.9957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03546 0.11995 0.296 0.76772
## x1 1.82167 0.18792 9.694 < 2e-16 ***
## x2 -2.78252 0.18792 -14.807 < 2e-16 ***
## x1:x2 0.95228 0.29441 3.235 0.00136 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 4.316076)
##
## Null deviance: 2674.6 on 299 degrees of freedom
## Residual deviance: 1277.6 on 296 degrees of freedom
## AIC: 1296
##
## Number of Fisher Scoring iterations: 2
anova(m6, test="Chisq") #Looks solid.
## Analysis of Deviance Table
##
## Model: gaussian, link: identity
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 299 2674.6
## x1 1 405.59 298 2269.0 < 2.2e-16 ***
## x2 1 946.29 297 1322.7 < 2.2e-16 ***
## x1:x2 1 45.16 296 1277.6 0.001218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m7 <-glm(y~x1+x2, data = sim4)
summary(m7) #missed interaction term here and given distribution exploration, likely important.
##
## Call:
## glm(formula = y ~ x1 + x2, data = sim4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.5514 -1.3859 -0.1107 1.4928 4.7180
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03546 0.12184 0.291 0.771
## x1 1.82167 0.19089 9.543 <2e-16 ***
## x2 -2.78252 0.19089 -14.577 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 4.453581)
##
## Null deviance: 2674.6 on 299 degrees of freedom
## Residual deviance: 1322.7 on 297 degrees of freedom
## AIC: 1304.5
##
## Number of Fisher Scoring iterations: 2
m.b2 <-anova(m6, m7, test="Chisq")
m.b2
## Analysis of Deviance Table
##
## Model 1: y ~ x1 * x2
## Model 2: y ~ x1 + x2
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 296 1277.6
## 2 297 1322.7 -1 -45.155 0.001218 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Case study #5: real data diamonds
#Lucy.
#See data viz week for first steps.
#real data always more complex
ggplot(diamonds) +
geom_point(aes(carat, price, colour = cut))
#so price is the most likely response we want to know about!
#two key factors, different x.classes, cut is categorical, and carat is continous
#look at price distribution (likely need to do by each x)
ggplot(diamonds) +
geom_freqpoly(aes(price)) #long tail
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#now do by the two xs to see how y varies
ggplot(diamonds, aes(carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)
set.seed(1000)
dsmall<-diamonds[sample(nrow(diamonds), 1000), ]
shapiro.test(dsmall$price)
##
## Shapiro-Wilk normality test
##
## data: dsmall$price
## W = 0.78845, p-value < 2.2e-16
qqnorm(dsmall$price)
#ok, so we have a handle on the distribution.
#certainly non-normal.
#last idea!
#before we move to fitting a model recognizing that the data look like negative binomial or poisson given the class, what if there are a relationship between the the xs?
#what if larger diamonds cost more and better cut diamonds costs more but there are not more better cut AND large diamonds out there? Covariation is not positive.
#EDA on just that
ggplot(diamonds, aes(cut, carat)) +
geom_boxplot()
m8 <-glm(carat~cut, data=diamonds)
summary(m8) #looks different! More poor cut diamonds are larger...
##
## Call:
## glm(formula = carat ~ cut, data = diamonds)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8261 -0.3764 -0.1064 0.2580 3.9639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.859299 0.002983 288.033 < 2e-16 ***
## cut.L -0.203597 0.007991 -25.478 < 2e-16 ***
## cut.Q 0.038498 0.007123 5.405 6.52e-08 ***
## cut.C -0.135611 0.006199 -21.877 < 2e-16 ***
## cut^4 -0.045096 0.004998 -9.022 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.2167356)
##
## Null deviance: 12119 on 53939 degrees of freedom
## Residual deviance: 11690 on 53935 degrees of freedom
## AIC: 70604
##
## Number of Fisher Scoring iterations: 2
library(lsmeans)
## Loading required package: estimability
lsmeans(m8, "cut", adjust="tukey") # to see ls means
## cut lsmean SE df asymp.LCL asymp.UCL
## Fair 1.0461366 0.011602517 NA 1.0233961 1.0688772
## Good 0.8491847 0.006646628 NA 0.8361575 0.8622118
## Very Good 0.8063814 0.004235413 NA 0.7980801 0.8146826
## Premium 0.8919549 0.003964307 NA 0.8841850 0.8997248
## Ideal 0.7028370 0.003171257 NA 0.6966214 0.7090525
##
## Results are given on the identity (not the response) scale.
## Confidence level used: 0.95
#ok now ready for a simple model.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
##
## survey
## The following object is masked from 'package:dplyr':
##
## select
m9 <-glm.nb(price~carat*cut, data = diamonds)
summary(m9)
##
## Call:
## glm.nb(formula = price ~ carat * cut, data = diamonds, init.theta = 7.433464732,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.8981 -0.7526 -0.0835 0.5011 5.1836
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.290973 0.005136 1224.818 < 2e-16 ***
## carat 1.955494 0.004778 409.299 < 2e-16 ***
## cut.L -0.363561 0.014021 -25.930 < 2e-16 ***
## cut.Q 0.300973 0.012388 24.295 < 2e-16 ***
## cut.C -0.286111 0.010504 -27.238 < 2e-16 ***
## cut^4 -0.035146 0.008200 -4.286 1.82e-05 ***
## carat:cut.L 0.531523 0.012490 42.555 < 2e-16 ***
## carat:cut.Q -0.325109 0.011250 -28.900 < 2e-16 ***
## carat:cut.C 0.367635 0.010142 36.249 < 2e-16 ***
## carat:cut^4 0.086441 0.008432 10.251 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(7.4335) family taken to be 1)
##
## Null deviance: 392022 on 53939 degrees of freedom
## Residual deviance: 55084 on 53930 degrees of freedom
## AIC: 887545
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 7.4335
## Std. Err.: 0.0444
##
## 2 x log-likelihood: -887523.2510
anova(m9, test="Chisq")
## Warning in anova.negbin(m9, test = "Chisq"): tests made without re-
## estimating 'theta'
## Analysis of Deviance Table
##
## Model: Negative Binomial(7.4335), link: log
##
## Response: price
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 53939 392022
## carat 1 333900 53938 58122 < 2.2e-16 ***
## cut 4 843 53934 57280 < 2.2e-16 ***
## carat:cut 4 2195 53930 55084 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tips lme4 for mixed models vegan for ordinations Lavaan for SEMs MASS for count and negative binomial data (1 | factor) treats as random factor
Ensure you see the different applications of the following models: anova lm glm glmm
EDA from a data science perspective: predictive
Practical skill outcomes of R stats useful for competency test Worflow description complete now.
Be able to use EDA & data viz to select a model. Be able to explore distributions of datasets. Be able to fit descriptive models (super simple to simple). Predictive models if you like too but not required. Be able to examine efficiacy of a model.
Recognize through application of a few models that the following rule is never broken in stats…
Rule: Statistics are never prescriptive.
Processes include description or prediction. Models are powerful, purposeful tools you can use to capture & communicate evidence.
Homework Revisit survey, ttc, or footprints data and end with a model.
Additional readings
A practical guide to linear models
General how to choose the right test
Here is a webinar I like and also a good book chapter.
Tutorial on reading data into R
Great read on efficient data carpentry
Efficient statistics slide deck
Halloween Hackathon Costumes options. Candy provided. Now, we practice from start to finish including submission of the r-script or rmd you write to turnitin.com
A graduate-level dataset. Apply your new mathemagical skills from scratch. A single three-hour block. As advanced as you want to be. Fundamental principles demonstrated. At least two fundamental statistical tests demonstrated.
Practical outcomes to demonstrate Rename some variables. Address missing values. Generate a simplified table/dataframe. Visualize the data to identify patterns and relationships. Produce a publication quality graphic. Do EDA on data very broadly. Do a single statistical address to capture main effect observed. Annotate
Rubric A total of 25% so 5 questions each worth 5 points. Likert Scale of 1 to 5 with 1 being low and 5 being awesome.
General best practices on scientific computing
I recommend printing them up. Also, scroll down on page, the regex one is great. Lists the gsub function.
deliberate practice. tested with feedback. a virtual repeat of last week with new data but evaluated. Pizza provided this week.
Prep suggestions 1. Learn how to use inner_join 2. Practice using the ‘file/knit document’ within RStudio. [I am happy to review and accept only the R code but as as backup a knit of the output including code as PDF is good. Also, knitted docs/pdfs/html pages are super for sharing your analyses with your graduate researcher collaborators or mentors]. This entire 6-week course is just a single RMD file knitted.